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Abstract —Data encoded as symmetric positive definite 
(SPD) matrices frequently arise in many areas of computer 
vision and machine learning. While these matrices form an 
open subset of the Euclidean space of symmetric matrices, 
viewing them through the lens of non-Euclidean Riemannian 
geometry often turns out to he better suited in capturing 
several desirable data properties. However, formulating clas¬ 
sical machine learning algorithms within such a geometry is 
often non-trivial and computationally expensive. Inspired by 
the great success of dictionary learning and sparse coding for 
vector-valued data, our goal in this paper is to represent data 
in the form of SPD matrices as sparse conic combinations 
of SPD atoms from a learned dictionary via a Riemannian 
geometric approach. To that end, we formulate a novel 
Riemannian optimization objective for dictionary learning and 
sparse coding in which the representation loss is characterized 
via the affine invariant Riemannian metric. We also present a 
computationally simple algorithm for optimizing our model. 
Experiments on several computer vision datasets demonstrate 
superior classification and retrieval performance using our 
approach when compared to sparse coding via alternative 
non-Riemannian formulations. 

Index Terms —Dictionary learning. Sparse coding, Rieman¬ 
nian distance. Region covariances 

I. Introduction 

S YMMETRIC positive definite (SPD) matrices play an 
important role as data descriptors in several computer 
vision applications, for example in the form of region 
covariances IT]. Notable examples where such descriptors 
are used include object recognition E, human detection 
and tracking a, visual surveillance a, 3D object recog¬ 
nition 0, among others. Compared with popular vector 
space descriptors, such as bag-of-words, Fisher vectors, 
etc., the second-order structure offered by covariance ma¬ 
trices is particularly appealing. For instance, covariances 
conveniently fuse multiple features into a compact form 
independent of the number of data points. By choosing 
appropriate features, this fusion can be made invariant to 
affine distortions El, or robust to static image noise and 
illumination variations. Moreover, generating these descrip¬ 
tors is easy, for instance using integral image transforms 0. 

We focus on SPD matrices for dictionary learning with 
sparse coding (DFSC) - a powerful data representation tool 
in computer vision and machine learning 0 that enables 
state-of-the-art results for a variety of applications fS], 0. 
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Fig. 1. An illustration of our Riemannian dictionary learning and 
sparse coding. For the SPD manifold Ai and learned SPD basis 
matrices Bi on the manifold, our sparse coding objective seeks a 
non-negative sparse linear combination atBi of the Bi’s that 
is closest (in a geodesic sense) to the given input SPD matrix X. 


Given a training set, traditional (Euclidean) dictionary 
learning problem seeks an overcomplete set of basis vectors 
(the dictionary) that can be linearly combined to sparsely 
represent each input data point; finding a sparse repre¬ 
sentation given the dictionary is termed sparse coding. 
While sparse coding was originally conceived for Euclidean 
vectors, there have been recent extensions of the setup to 
other data geometries, such as histograms M, Grassman- 
nians ifTTIl . and SPD matrices ifT^ . ifTTll . ifTTl . ifTSl . The 
focus of this paper is on dictionary learning and sparse 
coding of SPD matrix data using a novel mathematical 
model inspired by the geometry of SPD matrices. 

SPD matrices are an open subset of the Euclidean space 
of symmetric matrices. This may lead one to believe that 
essentially treating them as vectors may suffice. However, 
there are several specific properties that applications us¬ 
ing SPD matrices demand. For example, in DT-MRI the 
semidefinite matrices are required to be at infinite distances 
from SPD matrices E). Using this and other geometrically 
inspired motivation a variety of non-Euclidean distance 
functions have been used for SPD matrices—see e.g., El, 
El, El- The most widely used amongst these is the affine 
invariant Riemannian metric El, El, the only intrinsic 
Riemannian distance that corresponds to a geodesic dis¬ 
tance on the manifold of SPD matrices. 
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In this paper, we study dictionary learning and sparse 
coding of SPD matrices in their natural Riemannian geom- 
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etxy. Compared to the Euclidean setup, their Riemannian 
geometry, however, poses some unique challenges: (i) the 
manifold defined by this metric is (negatively) curved 1^ , 
and thus the geodesics are no more straight-lines; and (ii) in 
contrast to the Euclidean DLSC formulations, the objective 
function motivated by the Riemannian distance is not 
convex in either its sparse coding part or in the dictionary 
learning part. We present some theoretical properties of 
our new DLSC formulation and mention a situation of 
purely theoretical interest where the formulation is convex. 
Eigure [T] conceptually characterizes our goal. 

The key contributions of this paper are as follows. 

• Formulation: We propose a new model to learn a dic¬ 
tionary of SPD atoms; each data point is represented 
as a nonnegative sparse linear combination of SPD 
atoms from this dictionary. The quality of the resulting 
representation is measured by the squared intrinsic 
Riemannian distance. 

• Optimization: The main challenge in using our for¬ 
mulation is its higher computational cost relative to 
Euclidean sparse coding. However, we describe a sim¬ 
ple and effective approach for optimizing our objective 
function. Specifically, we propose a dictionary learn¬ 
ing algorithm on SPD atoms via conjugate gradient 
descent on the product manifold generated by the SPD 
atoms in the dictionary. 

A forerunner to this paper appeared in ED. The current 
paper differs from our conference paper in the following 
major aspects: (i) we propose a novel dictionary learning 
formulation and an efficient solver for it; and (ii) extensive 
new experiments using our dictionary learning are also 
included and the entire experimental section re-evaluated 
under our new setup while also including other datasets 
and evaluation metrics. 

To set the stage for presenting our contributions, we first 
review key tools and metrics from Riemannian geometry 
that we use to develop our ideas. Then we survey some 
recent methods suggested for sparse coding using alterna¬ 
tive similarity metrics on SPD matrices. Throughout we 
work with real matrices; extension to Hermitian positive 
definite matrices is straightforward. The space of d x d 
SPD matrices is denoted as 5^, symmetric matrices by 
<5"^, and the space of (real) invertible matrices by GL((i). 
By Log(A'), for X G S_f_, we mean the principal matrix 
logarithm and log|X| denotes the scalar logarithm of the 
matrix determinant. 

II. Preliminaries 

We provide below a brief overview of the Riemannian 
geometry of SPD matrices. A manifold AI is a set of points 
endowed with a locally-Euclidean structure. A tangent 
vector at P € Ad is defined as the tangent to some curve 
in the manifold passing through P. A tangent space TpAi 
defines the union of all such tangent vectors to all possible 
such curves passing through P; the point P is termed 
the/oof of this tangent space. The dimensionality of TpXi 
is the same as that of the manifold. It can be shown that the 


tangent space is isomorphic to the Euclidean space Il22l : 
thus, they provide a locally-linear approximation to the 
manifold at its foot. 

A manifold becomes a Riemannian manifold if its tan¬ 
gent spaces are endowed with a smoothly varying inner 
product. The Euclidean space, endowed with the classical 
inner product defined by the trace function (i.e., for two 
points X,Y G S^,{X,Y) = Tr(Xy)), is a Riemannian 
manifold. Recall that an SPD matrix has the property that 
all its eigenvalues are real and positive, and it belongs to 
the interior of a convex self-dual cone. Since for dxd SPD 
matrices, this cone is a subset of the i(i((i-|-l)-dimensional 
Euclidean space of symmetric matrices, the set of SPD 
matrices naturally forms a Riemannian manifold under the 
trace metric. However, under this metric, the SPD manifold 
is not complete Q- This is because, the trace metric does not 
enclose all Cauchy sequences originating from the interior 
of the SPD cone ||2T1 . 

A possible remedy is to change the geometry of the 
manifold such that positive semi-definite matrices (which 
form the closure of SPD matrices for Cauchy sequences) 
are at an infinite distance to points in the interior of the SPD 
cone. This can be achieved by resorting to the the classi¬ 
cal log-barrier function g{P) = — logdet(P), popular in 
the optimization community in the context of interior point 
methods ll24l . The trace metric can be modified to the new 
geometry induced by the barrier function by incorporating 
the curvature through its Hessian operator at point P in the 
direction Z given by Hp{Z) = g"{P){Z) = P~^ZP~^. 
The Riemannian metric at P for two points Z\,Zi G TpXi 
is thus defined as: 

(Zi, Z2)p = {HpZu Zf) = Yi{p-^ZiP-^Z2). (1) 


There are two fundamental operations that one needs for 
computations on Riemannian manifolds: (i) the exponential 
map Expp : S‘^ —>■ S‘1', and (ii) the logarithmic map 
Logp = Exp/^ : where P G The former 

projects a symmetric point on the tangent space onto the 
manifold (termed a retraction), the latter does the reverse. 
Note that these maps depend on the point P at which the 
tangent spaces are computed. In our analysis, we will be 
measuring distances assuming P to be the identity matrix, 
I, in which case we will omit the subscript. 

Note that the Riemannian metric provides a measure for 
computing distances on the manifold. Given two points on 
the manifold, there are infinitely many paths connecting 
them, of which the shortest path is termed the geodesic. It 
can be shown that the SPD manifold under the Riemannian 
metric in ([T]) is non-positively curved (Hadamard manifold) 
and has a unique geodesic between every distinct pair of 
points ED [Chap. 12], ||26l[Chap. 6]. Eor X,Y GSf, there 
exists a closed form for this geodesic distance, given by 


dn{X,Y) 


Log(x-i/2rA:-i/2) 


( 2 ) 


space is a complete metric space if all Cauchy sequences are 
convergent within that space. 
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where Log is the matrix logarithm. It can be shown that this 
distance is invariant to affine transformations of the input 
matrices mi and thus is commonly referred to as the Affine 
invariant Riemannian metric. In the sequel, we will use Q 
to measure the distance between input SPD matrices and 
their sparse coded representations obtained by combining 
dictionary atoms. 

III. Related Work 

Dictionary learning and sparse coding of SPD matrices 
has received significant attention in the vision community 
due to the performance gains it brings to the respective 
applications Q, m, mi. Given a training dataset X, the 
DLSC problem seeks a dictionary B of basis atoms, such 
that each data point x G X can be approximated by a 
sparse linear combination of these atoms while minimizing 
a suitable loss function. Formally, the DLSC problem can 
be abstractly written as 

min 'Y' £(x, + ASp(6'x), (3) 

VxGA’ 

where the loss function C measures the approximation 
quality obtained by using the “code” 9x, while A regulates 
the impact of the sparsity penalty Sp(0a;). 

As alluded to earlier, the manifold geometry hinders 
a straightforward extension of classical DLSC techniques 
(such as II 27 I . lUl) to data points drawn from a manifold. 
Prior methods typically use surrogate similarity distances 
that bypass the need to operate within the intrinsic Rieman¬ 
nian geometry, e.g., (i) by adapting information geometric 
divergence measures such as the log-determinant diver¬ 
gence or the Stein divergence, (ii) by using extrinsic metrics 
such as the log-Euclidean metric, and (iii) by relying on 
the kernel trick to embed the SPD matrices into a suitable 
RKHS. We briefly review each of these schemes below. 
Statistical measures: In fill and ||29l, a DLSC framework 
is proposed based on the log-determinant divergence (Burg 
loss) to model the loss. For two SPD matrices X,Y G 
this divergence has the following form dB(X,Y) = 
Tt(XY~^) — log — d. Since this divergence acts 

as a base measure for the Wishart distribution a — a natural 
probability density on SPD matrices—a loss defined using 
it is statistically well-motivated. The sparse coding formu¬ 
lation using this loss reduces to a MaxDet optimization 
problem Cl and is solved using interior-point methods. 
Unsurprisingly, the method is often seen to be computa¬ 
tionally demanding even for moderately sized covariances 
(more than 5 x 5). Ignoring the specific manifold geometry 
of SPD matrices, one may directly extend the Euclidean 
DLSC schemes to the SPD setting. However, a naive use 
of Euclidean distance on SPD matrices is usually found 
inferior in performance. It is argued in m that approxi¬ 
mating an SPD matrix as sparse conic combinations of pos¬ 
itive semi-definite rank-one outer-products of the Euclidean 
dictionary matrix atoms leads to improved performance 
under the Euclidean loss. However, the resulting dictionary 
learning subproblem is nonconvex and the reconstruction 
quality is still measured using a Euclidean loss. Eurther, 


discarding the manifold geometry is often seen to showcase 
inferior results compared to competitive methods ED. 

Differential geometric schemes: Among the computation¬ 
ally efficient variants of Riemannian metrics, one of the 
most popular is the log-Euclidean metric die ifTbl defined 
for G 5^ as di,{X,Y) := ||Log(X) - Log(y)||p. 

The Log operator maps an SPD matrix isomorphically and 
diffeomorphically into the flat space of symmetric matrices; 
the distances in this space are Euclidean. DLSC with the 
squared log-Euclidean metric has been proposed in the 
past ll^ with promising results. A similar framework was 
suggested recently ED in which a local coordinate system 
is defined on the tangent space at the given data matrix. 
While, their formulation uses additional constraints that 
make their framework coordinate independent, their scheme 
restricts sparse coding to specific problem settings, such as 
an affine coordinate system. 

Kernelized Schemes: In ifT^ . a kernelized DLSC frame¬ 
work is presented for SPD matrices using the Stein diver¬ 
gence ifTTll for generating the underlying kernel function. 
Eor two SPD matrices X, Y, the Stein divergence is de¬ 
fined as ds{X,Y) = log I^(AT-I-y)I — ilog|Xy|. As 
this divergence is a statistically well-motivated similarity 
distance with strong connections to the natural Riemannian 
metric ( 03 . ifTTll l while being computationally superior, 
performances using this measure are expected to be simi¬ 
lar to those using the Riemannian metric 03 . However, 
this measure does not produce geodesically exponential 
kernels for all bandwidths HD making it less appealing 
theoretically. In ID, ifTSll kernels based on the log-Euclidean 
metric are proposed. A general DLSC setup is introduced 
for the more general class of Riemannian manifolds in 04l . 
The main goal of all these approaches is to linearize the 
curved manifold geometry by projecting the SPD matrices 
into an infinite dimensional Hilbert space as defined by the 
respective kernel. However, as recently shown theoretically 
in El, ES most of the curved Riemannian geometries 
(including the the span of SPD matrices) do not have such 
kernel maps, unless the geometry is already isometric to the 
Euclidean space (as in the case of the log-Euclidean metric). 
This result severely restricts the applicability of traditional 
kernel methods to popular Riemannian geometries (which 
are usually curved), thereby providing strong motivation to 
study the standard machine learning algorithms within their 
intrinsic geometry — as is done in the current paper. 

In light of the above summary, our scheme directly uses 
the affine invariant Riemannian metric to design our sparse 
reconstruction loss. To circumvent the computational diffi¬ 
culty we propose an efficient algorithm based on spectral 
projected gradients for sparse coding, while we use an adap¬ 
tation of the non-linear conjugate gradient on manifolds for 
dictionary learning. Our experiments demonstrate that our 
scheme is computationally efficient and provides state of 
the art results on several computer vision problems that 
use covariance matrices. 
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IV. Problem Formulation 

Let X = {Xi,X 2 , ■ ■ ■ , ^at} denote a set of N SPD data 
matrices, where each Xi G 5^. Let denote the product 
manifold obtained from the Cartesian product of n SPD 
manifolds, i.e., C Our goals are (i) 

to learn a third-order tensor (dictionary) B £ in which 
each slice represents an SPD dictionary atom Bj £ j = 
1 ,2 ,-•• ,n; and (ii) to approximate each Xi as a sparse 
conic combination of atoms in B; i.e., Xi ^ Ba^ where 
cti £ R" and Bu := for n-dimensional 

vector V. With this notation, our joint DLSC objective is 

1 ^ 

min Sp(aj)-f fl(B), (4) 

where Sp and 12 are regularizers on the coefficient vectors 
aj and the dictionary tensor respectively. 

Although formulation (|4]i may look complicated, it is a 
direct analogue of the vectorial DLSC setup to matrix data. 
For example, instead of learning a dictionary matrix in the 
vectorial DLSC, we learn a third-order tensor dictionary 
since our data X are now matrices. The need to constrain 
the sparse coefficients to the non-negative orthant is re¬ 
quired to make sure the linear combination of SPD atoms 
stays within the SPD cone. However, in contrast to the 
vectorial DLSC formulations for which the subproblems 
on the dictionary learning and sparse coding are convex 
separately, the problem in (01 is neither convex in itself, 
nor are its subproblems convex. 

From a practical point of view, this lack of convexity 
it is not a significant concern as all we need is a set 
of dictionary atoms which can sparse code the input. To 
this end, we propose below an alternating minimization 
(descent) scheme that alternates between locally solving the 
dictionary learning and sparse coding sub-problems, while 
keeping fixed the variables associated with the other. A full 
theoretical analysis of the convergence of this nonconvex 
problem is currently beyond the scope of this paper and of 
most versions of nonconvex analysis known to us. However, 
what makes the method interesting and worthy of future 
analysis is that empirically it converges quite rapidly as 
shown in Figure 5. 


A. Dictionary Learning Subproblem 

Assuming the coefficient vectors a available for all the 
data matrices, the subproblem for updating the dictionary 
atoms can be separated from (|4]i and written as: 


min 0(B) := 


1 

-^4(X„Ba,) + H(B), 

i=i 


1 

2 


N 

E||Log(^"MBa,)X-^) 


i=i 


Eh(B). 

F 


(5) 

(6) 


1) Regularizers: Before delving into algorithms for opti¬ 

mizing (|6]l, let us recall a few potential regularizers fl on the 
dictionary atoms, which are essential to avoid overfitting the 
dictionary to the data. For SPD matrices, we have several 
regularizers available, such as: (i) the largest eigenvalue 
regularizer H(B) = 11^*112’ deviation of the dic¬ 

tionary from the identity matrix H(B) = \\Bi —/||p, 
(iii) the Riemannian elasticity regularizer which mea¬ 
sures the Riemannian deformation of the dictionary from 
the identity matrix H(B) = ~ logWIlp = 

d-jiiBi,!)^, and (iv) the trace regularizer, i.e., fl(B) = 
Ab X]r=i Tr(i3i), for a regularization parameter Ab- In the 
sequel, we use the unit-trace regularizer as it is simpler and 
performs well empirically. 

2) Optimizing Dictionary Atoms: Among several 
first-order alternatives for optimizing over the SPD 
atoms (such as the steepest-descent, trust-region 
methods lEll, etc.), the Riemannian Conjugate Gradient 
(CG) method ll22lrChap.81. was found to be empirically 
more stable and faster. Below, we provide a short 
exposition of the CG method in the context of minimizing 
over B which belongs to an SPD product manifold. 

For an arbitrary non-linear function 9{x), x £ R", the 
CG method uses the following recurrence at step k + \ 

^k+l — 't'k ^k^kt ( 7 ) 

where the direction of descent is 

ik = - gradfljccfe) -f Pk^k-i, (8) 

with grad6{xk) defining gradient of 9 at Xk (■fo = 
-grad6>(a::o)), and pk given by 

_ (grad6>(a:fc))'^(grad6l(a:fc) - grad6>(a:fc-i)) 
grad 6i(a:fe_i)^ grad 6>(xfc_i) 

The step-size yk in © is usually found via an efficient 
line-search method ll3^ . It can be shown that [Sec. 1.6] 
when 9 is quadratic with a Hessian Q, the directions 
generated by (0 will be Q-conjugate to previous directions 
of descent • • • , thereby © providing the exact 

minimizer of / in fewer than d iterations (d is the manifold 
dimension). 

For B £ and referring back to ©, the recurrence 
in © will use the Riemannian retraction ||22l [Chap.4] 
and the gradient grad0(Bfe) will assume the Riemannian 
gradient (here we use B^ to represent the dictionary tensor 
at the A:-th iteration). This leads to an important issue: the 
gradients grad0(Bfc) and grad0(Bfc_i) belong to two 
different tangent spaces T-b^AA and Tbj,_iA 4 respectively, 
and thus cannot be combined as in ©. Thus, follow¬ 
ing II 22 I [Chapter 8] we resort to vector transport - a scheme 
to transport a tangent vector at P £ A4 to a point Expp(S') 
where S £ TpM. and Exp is the exponential map. The 
resulting formula for the direction update becomes 

Cb;, = - grad 0(Bfe) -f (Cfe-i), (10) 

where 

_ (grad 0(Bfc), grad 0(Bfc) - (grad 0(Bfe_i))) 

(grad 0(Bfc_i), grad 0(Bj;_i)) 


( 11 ) 
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Here for Z\,Zi G TpM., the map ‘Zzx{Z 2 ) defines the 
vector transport given by: 


^zAZ2) 


— expp(Zi + tZ2) 


t=o 


( 12 ) 


The remaining technical detail is the expression for the 
Riemannian gradient grad0(B), which we derive next. 

3) Riemannian Gradient: The following lemma con¬ 
nects the Riemannian gradient to the Euclidean gradient 
of 0(B) in (|6]l. 


Lemma 1. For a dictionary tensor B G let 0(B) 
be a differentiable function. Then the Riemannian gradient 
grad 0(B) satisfies the following condition: 


(grad0(B), Ob = (V0(B), 0/, VC € TpMt (13) 


where V0(B) is the Euclidean gradient of 0(B). The 
Riemannian gradient for the i-th dictionary atom is given 
grad, 0(B) = B,Vb.0(B)S,. 

Proof: See (221 [Chap. 5]. The latter expression is 
obtained by substituting the inner product on the LHS 
of (fljl l by its definition in ([T]i. ■ 

We can derive the Euclidean gradient V0(B) as follows: 

let Sj = Xj ^ and Mj(B) = Ba^ = Yll=i Then, 

.AT n 

0(B) = - ^ Tr(Log(5,M0B)50') + ^ Tr(B,). 

j=l 

(14) 

The derivative of (fT4l t w.r.t. to atom Bi is; 

N 

^a}(5,Log(M0B))(M0B))-'5,) +Ab/. (15) 

i=i 


B. Sparse Coding Subproblem 

Referring back to dUl, let us now consider the sparse 
coding subproblem. Suppose we have a dictionary tensor 
B available. Eor a data matrix Xj G Sf our sparse coding 
objective is to solve 


min f(aj) := -d^ (Xi,Baj) + Sp(a0 

aj>0 Z 

» •^TL .1 1 ^ 

Log^ _^ ajX~^BjX~^ +Sp(aj), 


(16) 


where a® is the z-th dimension of aj and Sp is a spar¬ 
sity inducing function. Eor simplicity, we use the sparsity 
penalty Sp(a) = A||a||i, where A > 0 is a regularization 
parameter. Since we are working with a > 0, we replace 
this penalty by A^^a^, which is differentiable. 


The subproblem ([Tdl l measures reconstruction quality 
offered by a sparse non-negative linear combination of 
the atoms to a given input point X. It will turn out (see 
experiments in Section |V]i that the reconstructions obtained 
via this model actually lead to significant improvements in 
performance over sparse coding models that ignore the rich 
geometry of SPD matrices. But this gain comes at a price: 
model ( [T6] ) is a nontrivial to optimize; it remains difficult 
even if we take into account geodesic convexity of d-p. 


While in practice this nonconvexity does not seem to 
hurt our model, we show below a surprising but intuitive 
constraint under which Problem ( [Thb actually becomes 
convex. The following lemma will be useful later. 

Lemma 2. Let B, C, and X be fixed SPD matrices. Con¬ 
sider the function f(x) = dp{xB -f C,X). The derivative 
f'(x) is given by 

f'{x) = 2Tr{\og{S{xB + C)S)S-^{xB + C)-^BS), 

(17) 

where S = X~ 2 . 

Proof: Introduce the shorthand M(x) = xB-\-C, from 
definition (l2]i and using ||.^||p = Tr(Z^Z) we have 

f(x) = TT(log(SM(x)Sf log(SM(x)S)). 

The chain-rule of calculus then immediately yields 

f'(x) = 2Tr(log(SM(x)S)(SM(x)S)-^SM'(x)S), 

which is nothing but (ITtI i. ■ 

As a brief digression, let us mention below an interesting 
property of the sparse-coding problem. We do not exploit 
this property in our experiments, but highlight it here for 
its theoretical appeal. 

Theorem 3. The function 4>{a) := d^{J2iC(iBi, X) is 
convex on the set 

A := {a \ aiBi A X, and a > 0}. (18) 

Proof: See Appendix lAl ■ 

Let us intuitively describe what Theorem [3 is saying. 
While sparsely encoding data we are trying to find sparse 
coefficients cti,..., a„, such that in the ideal case we have 
'Yhi^^iBi = X. But in general this equality cannot be 
satisfied, and one only has UiBi k, X, and the quality 
of this approximation is measured using 4>{a) or some 
other desirable loss-function. The loss ^{a) from ( fTbl ) is 
nonconvex while convexity is a “unilateral” property—it 
lives in the world of inequalities rather than equalities 1 ^ . 
And it is known that SPD matrices in addition to forming a 
manifold also enjoy a rich conic geometry that is endowed 
with the Lowner partial order. Thus, instead of seeking 
arbitrary approximations atBi « X, if we limit our 
attention to those that underestimate X as in (ITsT l. we might 
benefit from the conic partial order. It is this intuition that 
Theorem [3] makes precise. 

1) Optimizing Sparse Codes: Writing M{ap) = apBp-\- 
J2i^p<^iBi and using Lemma |2] we obtain 

= Tr (\og{SM{ap)S) {SM{ap)Sy"SBpS'^ + A. 

(19) 

Computing ([T9] l for all a is the dominant cost in a gradient- 
based method for solving dUl. We present pseudocode 
(Alg.ID that efficiently implements the gradient for the first 
part of (IT^ . The total cost of Alg. [T|is 0{nd^) 0{df) — 

a naive implementation of (IT^ costs 0{nd^), which is 
substantially more expensive. 
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Input: Bi ,. 

., Bn , G , a > 0 

S G- 


T ^ S\og{SMS){MS)-^-, 

for i = 1 to 

n do 

1 gi^Tr[TBp)\ 

end 


return g 



Algorithm 1: Jifticient computation of gradients 


Alg.[I]in conjunction with a gradient-projection scheme 
essentially runs the iteration 

c^fc+i ^ fe = 0,l,..., (20) 

where 3^[-] denotes the projection operator defined as 

= a !-)• argmin^/i||a'— a|| 2 , s.t. a'€ .4. (21) 

Iteration (l20l i has three major computational costs: (i) 
the stepsize (ii) the gradient V(()(a^'); and (iii) the 
projection (EB. Aig. m shows how to efficiently obtain the 
gradient. The projection task (l2Tl i is a special least-squares 
(dual) semidefinite program (SDP), which can be solved 
using any SDP solver or by designing a specialized routine. 
To avoid the heavy computational burden imposed by an 
SDP, we drop the constraint a € A', this sacrifices convexity 
but makes the computation vastly easier, since with this 
change, we simply have ^[a] = max(0, a). 

In (l20l i. it only remains to specify how to obtain the 
stepsize rjk- There are several choices available in the 
nonlinear programming literature lIMl for choosing r]k, 
but most of them can be quite expensive. In our quest 
for an efficient sparse coding algorithm, we choose to 
avoid expensive line-search algorithms for selecting pfc and 
prefer to use the Barzilai-Borwein stepsizes iQi, which 
can be computed in closed form and lead to remarkable 
gains in performance Ho), ED. In particular, we use the 
Spectral Projected Gradient (SPG) method B2l by adapting 
a simplified implementation of ED- 

SPG runs iteration (l20l i using Barzilai-Borwein stepsizes 
with an occasional call to a nonmontone line-search strategy 
to ensure convergence of {a^}. Without the constraint 
a' G A, we cannot guarantee anything more than a sta¬ 
tionary point of Ell while if we were to use the additional 
constraint then we can even obtain global optimality for 
iterates generated by (l20l i. 

V. Experiments 

In this section, we provide experiments on simulated 
and real-world data demonstrating the effectiveness of our 
algorithm compared to the state-of-the-art DLSC methods 
on SPD valued data. First, we demonstrate results on simu¬ 
lated data analyzing the performance of our framework for 
various settings. This will precede experiments on standard 
benchmark datasets. 


A. Comparison Methods 

In the experiments to follow, we will denote dictionary 
learning and sparse coding algorithms by DL and SC 
respectively. We will compare our Riemannian (Riem) 
formulation against combinations of several state-of-the-art 
DLSC methods on SPD matrices, namely using (i) log- 
Euclidean (LE) metric for DLSC ll^ . (ii) Frobenius norm 
(Frob) which discards the manifold structure, (iii) kernel 
methods such as the Stein-Kernel [18] proposed in [12] 
and the log-Euclidean kernel [13]. 

B. Simulated Experiments 

In this subsection, we evaluate in a controlled setting, 
some of the properties of our Riemannian sparse coding 
scheme. For all our simulations, we used covariances 
generated from data vectors sampled from a zero-mean 
unit covariance normal distribution. For each covariance 
sample, the number of data vectors is chosen to be ten times 
its dimensionality. For fairness of the comparisons, we 
adjusted the regularization parameters of the sparse coding 
algorithms so that the codes generated are approximately 
10% sparse. The plots to follow show the performance 
averaged over 50 trials. Further, all the algorithms in this 
experiment used the SPG method to solve their respective 
formulations so that their performances are comparable. 
The intention of these timing comparisons is to empirically 
point out the relative computational complexity of our 
Riemannian scheme against the baselines rather than to 
show exact computational times. For example, for the 
comparisons against the method Frob-SC, one can vectorize 
the matrices and then use a vectorial sparse coding scheme. 
In that case, Frob-SC will be substantially faster, and 
incomparable to our scheme as it solves a different problem. 
In these experiments, we will be using the classification 
accuracy as the performance metric. Our implementations 
are in MATLAB and the timing comparisons used a single 
core Intel 3.6GHz CPU. 

1) Increasing Data Dimensionality: While DT-MRI ap¬ 
plications typically use small SPD matrices (3 x 3), the 
dimensionality is very diverse for other applications in 
computer vision. For example, Gabor covariances for face 
recognition uses about 40-dimensional SPD matrices B3l . 
while even larger covariance descriptors are becoming 
common El. The goal of this experiment is to analyze the 
scalability of our sparse coding setup against an increasing 
size of the data matrices. To this end, we fixed the number 
of dictionary atoms to be 200, while increased the matrix 
dimensionality from 3 to 100. Figure [2(a)| plots the time- 
taken by our method against the naive Frob-SC method 
(although it uses the SPG method for solution). The plot 
shows that the extra computations required by our Riem-SC 
is not substantial compared to Frob-SC. 

2) Increasing Dictionary Size: In this experiment, we 
compare the scalability of our method to work with larger 
dictionary tensors. To this end, we fixed the data dimen¬ 
sionality to 10, while increased the number of dictionary 
atoms from 20 to 1000. Figure [2(b)| plots the time-taken 
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against the dictionary size. As is expected, the sparse 
coding performance for all the kernelized schemes drops 
significantly for larger dictionary sizes, while our scheme 
performs fairly. 

3) Increasing Sparsity Regularization: In this experi¬ 
ment, we decided to evaluate the effect of the sparsity pro¬ 
moting regularization A in (fThl l. To this end, we generated 
a dictionary of 100 atoms from covariances of Gaussian 
random variables. Later, 1000 SPD matrices are produced 
using conic combinations of randomly selected atoms. We 
used an active size of 10 dictionary atoms for all the SPD 
matrices. After adding random SPD noise to each matrix, 
we used half of them for learning the dictionary, while the 
other half was used for evaluating the sparsity regulariza¬ 
tion. We increased A from 10“® to 10® at steps of 10. In 
Figure |3(a)| we plot the sparsity (i.e., number of non-zero 
coefficients/size of coefficients) for varying A. We see that 
while the lower values of A does not have much influence 
on sparsity, as A increases beyond a certain threshold, 
sparsity increases. A similar trend is seen for increasing 
data dimensionality. However, we And that the influence 
of A starts diminishing as the dimensionality increases. For 
example, sparsity plateaus after 3% for 5-dimensional data, 
while this happens at nearly 15% for 20-dimensional data. 
The plateauing of sparsity is not unexpected and is directly 
related to the Riemannian metric that we use - our loss 
will prevent all the sparse coefficients from going to zero 
simultaneously as in such a case the objective will tend to 
infinity. Further, as the matrix dimensionality increases, it is 
more likely that the data matrices become ill-conditioned. 
As a result, this plateau-ing happens much earlier than for 
better conditioned matrices (as in the case of 5-dimensional 
matrices in Figure |3(a)| i. 

In Figure |3(b)| we contrast the sparsity pattern produced 
by our Riemannian sparse coding (Riem-DL + Riem- 
SC) scheme against that of the traditional sparse coding 
objective using log-Euclidean sparse coding (LE-DL + LE- 
SC), for 20-dimensional SPD data. As is expected, the 
log-Euclidean DL follows the conventional convergence 
patterns in which sparsity goes to zero for larger values 
of the regularization. Since for larger regularizations, most 
of the coefficients in our Riem-SC have low values, we 
can easily discard them by thresholding. However, we 
believe this difference in the sparsity patterns needs to be 
accounted for when choosing the regularization parameters 
for promoting sparsity in our setup. 

4) Convergence for Increasing Dimensionality: In this 
experiment, we evaluate the convergence properties of our 
dictionary learning sub-problem based on the Riemannian 
conjugate gradient scheme. To this end, we used the same 
setup as in the last experiment using data generated by a 
pre-deflned dictionary, but of different dimensionality (G 
{3,5,10, 20}). In Eigure|5] we plot the dictionary learning 
objective against the iterations. As is expected, smaller data 
dimensionality shows faster convergence. That said, even 
20-dimensional data was found to converge in less than 50 
alternating iterations of the algorithm, which is remarkable. 


C. Experiments with Public Datasets 

Now let us evaluate the performance of our framework 
on computer vision datasets. We experimented on data 
available from four standard computer vision applications, 
namely (i) 3D object recognition on the RGBD objects 
dataset ||45]| . (ii) texture recognition on the standard Brodatz 
dataset ll46l . (iii) person re-identification on the ETHZ peo¬ 
ple dataset El, and (iv) face recognition on the Youtube 
faces dataset HSl . We describe these datasets below. 

1) Brodatz Texture: Texture recognition is one of the 
most successful applications of covariance descriptors 1491 . 
m . Eor this evaluation, we used the Brodatz texture 
datasell, from which we took 100 gray scale texture images, 
each of dimension 512 x 512. We extracted 32 x 32 patches 
from a dense grid without overlap thus generating 256 
texture patches per image, and totalling 25600 patches 
in our dataset. To generate covariance descriptors from 
each patch, we followed the traditional protocol, i.e., we 
extracted a 5-dimensional feature descriptor from each 
pixel location in each patch. The features are given by: 
Ftextures = [a;, U, F, abs{Ix), abs{Iy)]^, where the first two 
dimensions are the coordinates of a pixel from the top-left 
corner of a patch, the last three dimensions are the image 
intensity, and image gradients in the x and y directions 
respectively. Region covariances of size 5x5 are computed 
from all features in a patch. 

2 ) ETHZ Person Re-identification Dataset: Tracking and 
identifying people in severely dynamic environments from 
multiple cameras play an important role in visual surveil¬ 
lance. The visual appearances of people in such applications 
are often noisy, and low resolution. Eurther, the appearances 
undergo drastic variations with respect to their pose, scene 
illumination, and occlusions. Lately, covariance descriptors 
have been found to provide a robust setup for this task 
na, Eoi. In this experiment, we evaluate the performance 
of clustering people appearances on the benchmark ETHZ 
dataset m. This dataset consists of low-resolution images 
of tracked people from a real-world surveillance setup. The 
images are from 146 different individuals. There are about 
5-356 images per person. Sample images from this dataset 
are shown in Eigure|4] There are a total of 8580 images in 
this dataset. 

Our goal in this experiment is to evaluate the perfor¬ 
mance of our DLSC framework to learn generic dictionaries 
on covariance descriptors produced from this application. 
Note that some of the classes in this dataset does not have 
enough instances to learn a specific dictionary for them. 
Several types of features have been suggested in literature 
for generating covariances on this dataset that have shown 
varying degrees of success such as Gabor wavelet based 
features lISOll . color gradient based features lfT2l . etc. Rather 
than detailing the results on several feature combinations, 
we describe here the feature combination that worked 
the best in our experiments. Eor this purpose, we used 
a validation set of 500 covariances and 10 true clusters 
from this dataset. The performance was evaluated using 

^ http://www.ux.uis.no/~tranden/brodatz.html 
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(a) 

Fig. 2. Sparse coding time against (a) increasing matrix dimensionality 
iterations for all the algorithms. 



(b) 


(b) increasing number of dictionaiy atoms. We used a maximum of 100 



Fig. 3. Sparsity of coding against (a) increasing sparsity inducing regularization A for vaiious matrix dimensionality and (b) sparsity against lambda 
in comparison to that for log-Euclidean DL. 



(a) RGBD objects (b) Brodatz textures (c) ETHZ appeai'ances (d) Youtube faces 


Fig. 4. Montage of sample images from the four datasets used in our experiments. From top, samples from the RGBD object recognition dataset, 
Brodatz texture recognition, ETHZ people re-identification dataset, and Youtube face recognition dataset. 


the Log-Euclidean SC setup with a dictionary learning via 
Log-Euclidean K-Means. We used a combination of nine 
features for each image as described below: 

Fethz = [x Ir Ig h Yi abs(/a:) abs(/y) (22) 
abs(sin(0) + cos{9)) abs(iLy)], 

where x is the x-coordinate of a pixel location, Ir,Ig,Ib 
are the RGB color of a pixel, Yi is the pixel intensity 
in the YCbCr color space, Ix,Iy are the gray scale pixel 
gradients, and Hy is the y-gradient of pixel hue. Eurther, 
we also use the gradient angle 9 = tan~^{Iy/1^;) in our 
feature set. Each image is resized to a fixed size 300 x 100, 
and is divided into upper and lower parts. We compute 


two different region covariances for each part, which are 
combined as two block diagonal matrices to form a single 
covariance descriptor of size 18 x 18 for each appearance 
image. 

3} 3D Object Recognition Dataset: The goal of this ex¬ 
periment is to recognize objects in 3D point clouds. To this 
end, we used the public RGB-D Object dataset ll45l . which 
consists of about 300 objects belonging to 51 categories 
and spread in about 250K frames. We used approximately 
15K frames for our evaluation with approximately 250-350 
frames devoted to every object seen from three different 
view points (30, 45, and 60 degrees above the horizon). 
Eollowing the procedure suggested in ||53][Chap. 5], for 
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Fig. 6. Results of Nearest neighbor recall@K accuracy against increasing number of retrieved points (K). Comparisons of Riem-DL and Riem-SC 
against other DL leai'ning schemes based on clustering, while using our Riem-SC for sparse coding. 


Method 

Accuracy (%) 

Riem-DL + Riem-SC 

74.9 

LE-DL + LE-SC 

73.4 

Frob-DL + Frob SC 

23.5 

Kemelized LE DLSC fT3l+ l34l 

47.9 

Kernelized Stein DLSC 1121 -f 1341 

76.7 

Tensor DL-fSC |52| 

37.1 

GDL [B] 

Al.l 

Random DL + Riem-SC 

70.3 

TABLE I 


Brodatz texture dataset 

Method 

Accuracy (%) 

Riem-DL + Riem-SC 

80.5 

LE-DL -1- LE-SC 

80.0 

Frob-DL + Frob SC 

77.6 

Kemelized LE DLSC flSl-l- l34l 

86.6 

Kemelized Stein DLSC [12] [34] 

71.6 

Tensor DL-fSC |52| 

67.4 

GDL [B] 

71.0 

Random DL + Riem-SC 

54.6 


TABLE III 

ETHZ People dataset 


Method 

Accuracy (%) 

Riem-DL + Riem-SC 

80.0 

LE-DL -1- LE-SC 

80.5 

Frob-DL + Frob SC 

54.2 

Kemelized LE DLSC fBl-l- l34l 

86.0 

Kemelized Stein DLSC 1121-f 1341 

85.7 

Tensor DL-hSC (52) 

68.1 

GDL [B] 

43.0 

Random DL -F Riem-SC 

62 

TABLE 11 


RGBD OBJECTS 



Method 

Accuracy (%) 

Riem-DL -F Riem-SC 

92.4 

LE-DL -1- LE-SC 

82.6 

Frob-DL -F Frob SC 

82.9 

Kemelized LE DSC fBl-H (34) 

93.1 

Kemelized Stein DLSC 1121 -F 1341 

70.1 

GDL [B] 

92.0 

Random DL -F Riem-SC 

83.9 


TABLE IV 

Youtube Faces Dataset 


TABLES: Comparison of classification accuracy (using a linear SVM and one-against-all classification) with spai'se coding when the dictionaiy is 
learned using the respective DL method. The standard deviation was less than 5% for all methods. 
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Fig. 5. Dictionary Learning objective using Riemannian conjugate 
gradient descent against increasing number of iterations (alternating with 
the sparse coding sub-problem). We plot the convergence of the objective 
for various dimensionality of the data matrices. 

every frame, the object was segmented out and 18 dimen¬ 
sional feature vectors generated for every 3D point in the 
cloud (and thus 18 x 18 covariance descriptors); the features 
we used are as follows; 

^RGBD \_^ iV: ^ ^ ^ ^g: ^ ^xx : ^yyi ^xy: 

; ^y: ; ^y ; b'z] : ( 23 ) 

where the first three dimensions are the spatial coordinates, 
Im is the magnitude of the intensity gradient, ^’s represent 
gradients over the depth-maps, and z/ represents the surface 
normal at the given 3D point. 

4) Youtube Faces Dataset: In this experiment, we eval¬ 
uate the performance of the Riemannian DLSC setup to 
deal with a larger dataset of high-dimensional covariance 
descriptors for face recognition. To this end, we used the 
challenging Youtube faces dataset ll48l that consists of 3425 
short video clips of 1595 individuals, each clip containing 
between 48-6K frames. There are significant variations in 
head pose, context, etc. for each person across clips and 
our goal is to associate a face with its ground truth person 
label. We proceed by first cropping out face regions from 
the frames by applying a state-of-the-art face detector El, 
which results in approximately 196K face instances. As 
most of the faces within a clip do not have significant 
variations, we subsample this set randomly to generate our 
dataset of ^43K face patches. Next, we convolved the 
image with a filter bank of 40 Gabor filters with 5 scales 
and 8 different orientations to extract the facial features for 
each pixel, generating 40 x 40 covariances. 

D. Experimental Setup 

1) Evaluation Techniques: We evaluated our algorithms 
from two perspectives, namely (i) nearest neighbor (NN) 
retrieval against a gallery set via computing the Euclidean 
distances between sparse codes, and (ii) one-against-all 
classification using a linear SVM trained over the sparse 
codes. Given that computing the geodesic distance between 
SPD matrices is expensive, while the Frobenius distance 
between them results in poor accuracy, the goal of the first 
experiment is to evaluate the quality of sparse coding to 


approximate the input data in terms of codes that belong 
to the non-negative orthant of the Euclidean space - su¬ 
perior performance implying that the sparse codes provide 
efficient representations that could bypass the Riemannian 
geometry, and can enable other faster indexing schemes 
such as locality sensitive hashing for faster retrieval. Our 
second experiment evaluates the linearity of the space of 
sparse codes - note that they are much higher dimensional 
than the original covariances themselves and thus we expect 
them to be linearly separable in the sparse space. 

2) Evaluation Metric: For classification experiments, 
we use the one-against-all classification accuracy as the 
evaluation metric. For NN retrieval experiments, we use 
the Recall@K accuracy, which is defined as follows. Given 
a gallery X and a query set Q. Recall@K computes the 
average accuracy when retrieving K nearest neighbors from 
X for each instance in Q. Suppose stands for the set 
of ground truth class labels associated with the gth query, 
and if denotes the set of labels associated with the K 
neighbors found by some algorithm for the q queries, then 

3) Data Split: All the experiments used 5-fold cross- 
validation in which 80% of the datasets were used for 
training the dictionary, 10% for generating the gallery set 
or as training set for the linear SVM, and the rest as the 
test/query points. We evaluate three setups for generating 
the dictionaries, (i) using a proper dictionary learning strat¬ 
egy, and (ii) using clustering the training set via K-Means 
using the appropriate distance metric, and (iii) random 
sampling of the training set. 

4) Hyperparameters: The size of the dictionary was 
considered to be twice the number of classes in the 
respective dataset. This scheme was considered for all 
the comparison methods as well. We experimented with 
larger sizes, but found that performance generally almost 
saturates. This is perhaps because the datasets that we use 
already have a large number of classes, and thus the dictio¬ 
nary sizes generated using this heuristic makes them already 
significantly overcomplete. The other hyperparameter in our 
setup is the sparsity of the generated codes. As the different 
sparse coding methods (including ours and the methods that 
we compare to) have varied sensitivity to the regularization 
parameter, comparing all the methods to different sparsities 
turned out to be cumbersome. Thus, we decided to fix 
the sparsity of all methods to 10%-sparse and adjusted the 
regularization parameter for each method appropriately (on 
a small validation set separate from the training set). To 
this end, we used As =0.1 for textures, 10 for the ETHZ 
and RGBD datasets, and 100 for the faces dataset. For the 
faces dataset, we found it to be difficult to attain the desired 
sparsity by tuning the regularization parameter. Thus, we 
used a regularization of 100 and selected the top 10% sparse 
coefficients. 

5) Implementation Details: Our DFSC scheme was im¬ 
plemented in MATFAB. We used the ManOpt Riemannian 
geometric optimization toolbox ll55l for implementing the 
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Fig. 7. Results of k-nearest neighbor retrieval accuracy against an increasing number of retrieved points (K). Comparisons of Riem-DL and Riem-SC 
against the state of the art dictionary learning and sparse coding combinations. 


CG method in our DL sub-problem. As our problem is non- 
convex, we found that initializing the dictionary learning 
setup using K-Means clustering (using the Karcher mean 
algorithm lfT9l l demonstrate faster convergence. 

E. Results 

In this section, we compare the performance of our 
dictionary learning (Riem-DL) and sparse coding (Riem- 
SC) method against several state of the art DLSC schemes 
on the four datasets that we described above. Our choice 
of comparison methods include (i) Riemannian geometric 
methods such as log-Euclidean (LE-DL + LE-SC), (ii) Ker- 
nelized methods using the Stein kernel (Kernelized Stein) 
with the framework in ini, (iii) Kernelized Stein using 
the recent generic framework in 041 . (iv) Kernelized Log- 
Euclidean metric proposed in 03 but using the generic 
framework in 04ll . (v) Euclidean DLSC (Erob-DL + Erob- 
SC), (vi) using a dictionary generated by random sampling 
the dataset followed by sparse coding using our Riemannian 
method (Random-DL + Riem-SC), (vii) using the tensor 
dictionary learning sparse coding (TSC) setup 021 . and 
(viii) generalized dictionary learning QSl. In Eigure |7] we 
show the performance comparison for the task of K-NN 


where K is increased from one to 25. In Tables U [III [Till 
and II VI we show the performance for the one-against-all 
classification setup. 

An alternative to dictionary learning that is commonly 
adopted is to approximate the dictionary by using the 
centroids of clusters generated from a K-Means clustering 
of the dataset. Such a method is faster in comparison 
to a Riemannian DL, while also demonstrate reasonable 
performance ED, m. Thus, an important experiment 
with regard to learning the dictionary is to make sure 
using dictionary learning provides superior performance 
compared to this ad hoc setup. In Eigure |6] we plot the K- 
NN retrieval when we use a clustering scheme to generate 
the dictionary. In Tables [V] IVII IVIIl and IVIIII we show the 
same in a classification setup. 

F. Discussion of Results 

With regard to Eigure |7] we found that the performance 
of different methods is diverse across datasets. Eor ex¬ 
ample, the log-euclidean DLSC variant (LE-DLh-LE-SC) 
is generally seen to showcase good performance across 
datasets. The kernelized DLSC methods (Kernelized Stein 
and Kernelized LE) demonstrate superior performance on 
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Method 

Accuracy (%) 

Method 

Accuracy (%) 

Riem-DL + Riem-SC 

74.9 

Riem-DL -l- Riem-SC 

80.0 

LE-KMeans + Riem-SC 

70.0 

LE-KMeans + Riem-SC 

66.2 

Frob-KMeans -l- Riem-SC 

66.5 

Frob-KMeans + Riem-SC 

61.1 

Riem-KMeans +Riem-SC 

70.0 

Riem-KMeans +Riem-SC 

67.8 

TABLE V 

Brodatz texture dataset 

TABLE VI 

RGBD OBJECTS 


Method 

Accuracy (%) 

Method 

Accuracy (%) 

Riem-DL + Riem-SC 

80.5 

Riem-DL + Riem-SC 

92.4 

LE-KMeans + Riem-SC 

54.9 

LE-KMeans + Riem-SC 

87.1 

Frob-KMeans + Riem-SC 

55.5 

Frob-KMeans + Riem-SC 

88.7 

Riem-KMeans +Riem-SC 

57.5 

Riem-KMeans +Riem-SC 

85.8 

TABLE VII 


TABLE VIII 



ETHZ People dataset Youtube Faces Dataset 


TABLES: Compaiison of classification accuracy (using a linear SVM and one-against-all classification) using Riemannian sparse coding (Riem-SC) 
while the dictionary atoms are taken as the centroids of K-Means clusters. The standard deviation was less than 8% for all methods. 


almost all the datasets. The most surprising of the results 
that we found was for the Frob-DL case. It is gener¬ 
ally assumed that using Frobenius distance for comparing 
SPD matrices leads to poor accuracy, which we see in 
Figures |7(a)| |7(b)| and |7(c)| However, for the Youtube 
faces dataset, we found that the SPD matrices are poorly 
conditioned. As a result, taking the logarithm (as in the 
LE-DL scheme) of these matrices results in amplifying the 
influence of the smaller eigenvalues, which is essentially 
noise. When learning a dictionary, the atoms will be learned 
to reconstruct this noise against the signal, thus leading to 
inferior accuracy than for FrobDL or GDL which do not use 
matrix logarithm. We tried to circumvent this problem by 
tuning the small regularization that we add to the diagonal 
entries of these matrices, but that did not help. Other older 
DLSC methods such as TSC are seen to be less accurate 
compared to recent methods. We could not run the TSC 
method on the faces dataset as it was found to be too 
slow to sparse code the larger covariances. In comparison 
to all the above methods, Riem-DLH-Riem-SC was found 
to produce consistent, competitive (and sometimes better) 
performance, substantiating the usefulness of our proposed 
method. While running the experiments, we found that 
the initialization of our DL sub-problem (from K-Means) 
played an important role in achieving this superior per¬ 
formance. In Tables U [III [Till and |IV] we show the results 
for classification using the sparse codes. The kernelized LE 
seems to be significantly better in this setting. However, our 
Riemannian scheme does demonstrate promise by being the 
second best in most of the datasets. 

The usefulness of our Riem-DL is further evaluated 
against alternative DL schemes via clustering in Eigure |6l 
We see that learning the dictionary using Riem-DL demon¬ 
strates the best performance against the next best and 
efficient alternative of using the LE-KMeans that was done 
in ||2TI| . Using Erob-KMeans or using a random dictionary 
are generally seen to have inferior performance compared 
to other learning methods. In Tables [Vl IVII IVIIl and I VIIIl 
a similar trend is seen in the classification setting. 


VI. Conclusions 

In this paper, we proposed a novel setup for dictionary 
learning and sparse coding of data in the form of SPD 
matrices. In contrast to prior methods that use proxy 
similarity distance measures to define the sparse coding 
approximation loss, our formulation used a loss driven by 
the natural Riemannian metric (affine invariant Riemannian 
metric) on the SPD manifold. We proposed an efficient 
adaptation of the well-known non-linear conjugate gradient 
method for learning the dictionary in the product space of 
SPD manifolds and a fast algorithm for sparse coding based 
on the spectral projected gradient. Our experiments on 
simulated and several benchmark computer vision datasets 
demonstrated the superior performance of our method 
against prior works; especially our results showed that 
learning the dictionary using our scheme leads to signif¬ 
icantly better accuracy (in retrieval and classification) than 
other heuristic and approximate schemes to generate the 
dictionary. 

Appendix 

Here we prove Theorem [3 

Lemma 4. Let Z G GL((i) and let X G Sf. Then, 
Z^XZ G Si- 

Lemma 5. The Frechet derivative 4561 see e.g., Ch. 1] of 
the map X i—>■ log X at a point Z in the direction E is 
given by 

D\og{Z){E) = + (1 - p)I)-^E(pZ + (1 - 

(25) 

Proof: See e.g.. ll56l Ch. 111. ■ 

Corollary 6. Consider the map f{a) := a G K” ea 
Tr(log(S'M(Q;)5')i7), where M is a map from M" —>■ 
and H G S G Sf. Then, for 1 < p < n, we have 

= pTr[KiSS^SKeHW, 

where Kp := {pSM{a)S -f (1 — j3)I)~^. 
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Proof: Simply apply the chain-rule of calculus and use 
linearity of Tr(-). ■ 

Lemma 7. The Frechet derivative of the map X X~^ 
at a point Z in direction E is given by 

D{Z-^){E) = -Z-^EZ-^. (26) 


We are now ready to prove Theorem [3 

Thm. |5} We show that the Hessian ^ 0 on 

A. To ease presentation, we write S = M = 

M{a) = and let Dq denote the differential 

operator Da ■ Applying this operator to the first-derivative 
given by Lemma |2] (in Section lIV-Bl i. we obtain (using the 
product rule) the sum 

Tv{\Dq \og{SMS)]{SMS)-^SBpS) 

+ Ai{\og{SMS)Dq[{SXIS)-^SBpS]). 

We now treat these two terms individually. To the first we 
apply Corr. |6] So 

Tr (\Dq \og{SMS)]{SMS )"^ SBpS) 

= jlTv{KpSBqSKp{SMS)-^SBpS)dl3 
= jlTv{SBqSKp{SMS)-^SBpSKp-)dl3 
= fo (^/ 3 (P)> 

where the inner-product (•, •)^ is weighted by {SMS)~^ 
and the map 'l'/ 3 (p) := SBpSKp. We find a similar inner- 
product representation for the second term too. Starting 
with Lemma |7] and simplifying, we obtain 

Tv{\og{SMS)Dq[{SMS)-^SBpS]) 

= -Ai{\og{SMS){SMS)-^SBqM-^BpS) 

= Tv{-S\og{SMS)S-^M-^BqM-^Bp) 

= Tv{M-^Bp[-S\og{SMS)S-^]Xr^Bq). 

By assumption UiBi = M < X, which implies 
SMS A I. Since log(-) is operator monotone 1^ . it 
follows that log(S'MS') A 0; an application of Lemma |4] 
then yields S'log(S'MS')S'“^ ^ 0. Thus, we obtain the 
weighted inner-product 


TT{M-^Bp[-S\og{SMS)S-^]M-^Bq) 

= (M-^Bp, M-^Bq)^ , 

where L = [—Slog(SMS)S“^] ^ 0, whereby (•, •)^ is a 
valid inner-product. 

Thus, the second partial derivatives of f may be ulti¬ 
mately written as 


d'^(j){a) 

dapdaq 


{T{Bq), T{Bp )), 


for some map L and some corresponding inner-product 
(the map and the inner-product are defined by our analysis 
above). Thus, we have established that the Hessian is a 
Gram matrix, which shows it is semidefinite. Moreover, if 
the Bi are different (1 < i < n), then the Hessian is strictly 
positive definite. ■ 
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